In this lab, we are going to work with vector data, which we’ve talked about last week. First, we are going to geocode addresses. We are going to work with a dataset of cancer patients across California, as well as Census data on socioeconomic factors. We will also talk about how to visualize data.
The objectives of this guide are to teach you:
Let’s get cracking!
We hopefully remember some of this from last week in Lab 2, but let’s open an R Markdown file by clicking on File at the top menu in RStudio, select New File, and then R Markdown…. A window should pop up. In that window, for title, put in “Lab 3”. For author, put your name. Leave the HTML radio button clicked, and select OK. A new R Markdown file should pop up in the top left window.
Let’s load some packages that we will need this week. We need to load
any packages we previously installed using the function
library(). Remember, install once, load every time. And if
it gives you an error for no package called..., then we
need to install those packages using install.packages(). So
when using a package, library() should always be at the top
of your R Markdown.
library(tidygeocoder)
library(sf)
library(MapGAM)
library(tidyverse)
library(tidycensus)
library(flextable)
library(tmap)
First, we are going to tackle how we take addresses and convert them to spatial data (latitude and longitude). So, let’s say we wanted to map all of the marijuana dispensaries across San Francisco. Let’s download a .csv of these addresses from the Github site, then take a look at the dataset.
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/san_francisco_active_marijuana_retailers.csv", "san_francisco_active_marijuana_retailers.csv", mode = "wb")
sf_mj <- read_csv("san_francisco_active_marijuana_retailers.csv")
## Rows: 33 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): License Number, License Type, Business Owner, Business Structure, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(sf_mj)
## # A tibble: 6 × 10
## `License Number` `License Type` `Business Owner` `Business Structure`
## <chr> <chr> <chr> <chr>
## 1 C10-0000614-LIC Cannabis - Retailer Li… Terry Muller Limited Liability C…
## 2 C10-0000586-LIC Cannabis - Retailer Li… Jeremy Goodin Corporation
## 3 C10-0000587-LIC Cannabis - Retailer Li… Justin Jarin Corporation
## 4 C10-0000539-LIC Cannabis - Retailer Li… Ondyn Herschelle Corporation
## 5 C10-0000522-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## 6 C10-0000523-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## # ℹ 6 more variables: `Premise Address` <chr>, Status <chr>,
## # `Issue Date` <chr>, `Expiration Date` <chr>, Activities <chr>,
## # `Adult-Use/Medicinal` <chr>
OK, some interesting columns there, and we have Premise Address as a column that we might want to make spatial. Let’s look closer at that.
head(sf_mj$`Premise Address`)
## [1] "2165 IRVING ST san francisco, CA 94122 County: SAN FRANCISCO"
## [2] "122 10TH ST SAN FRANCISCO, CA 941032605 County: SAN FRANCISCO"
## [3] "843 Howard ST SAN FRANCISCO, CA 94103 County: SAN FRANCISCO"
## [4] "70 SECOND ST SAN FRANCISCO, CA 94105 County: SAN FRANCISCO"
## [5] "527 Howard ST San Francisco, CA 94105 County: SAN FRANCISCO"
## [6] "2414 Lombard ST San Francisco, CA 94123 County: SAN FRANCISCO"
OK that column looks like what we want to geocode. But how do we take
these addresses and make them into spatial information? We have to
geocode them! To do so, we will use the tidygeocoder
package in R. But first, we see that the addresses look a little
strange. The address county is always “County: SAN FRANCISCO” so we will
gsub() out that entire string.
sf_mj$`Premise Address` <- gsub(" County: SAN FRANCISCO",
"", sf_mj$`Premise Address`)
head(sf_mj$`Premise Address`)
## [1] "2165 IRVING ST san francisco, CA 94122"
## [2] "122 10TH ST SAN FRANCISCO, CA 941032605"
## [3] "843 Howard ST SAN FRANCISCO, CA 94103"
## [4] "70 SECOND ST SAN FRANCISCO, CA 94105"
## [5] "527 Howard ST San Francisco, CA 94105"
## [6] "2414 Lombard ST San Francisco, CA 94123"
That looks much better.
Now let’s give a try to geocoding these addresses with the
tidygeocoder package. We will use the
geocode() function to add a latitude and longitude to each
of our addresses in the Premise Address column. We will use the
Open Street Map address database by specifying
method = "osm". This will take about a minute to run, so be
patient!
sf_mj_geo <- geocode(sf_mj, "Premise Address",
method = "osm")
## Passing 33 addresses to the Nominatim single address geocoder
## Query completed in: 34.2 seconds
head(sf_mj_geo)
## # A tibble: 6 × 12
## `License Number` `License Type` `Business Owner` `Business Structure`
## <chr> <chr> <chr> <chr>
## 1 C10-0000614-LIC Cannabis - Retailer Li… Terry Muller Limited Liability C…
## 2 C10-0000586-LIC Cannabis - Retailer Li… Jeremy Goodin Corporation
## 3 C10-0000587-LIC Cannabis - Retailer Li… Justin Jarin Corporation
## 4 C10-0000539-LIC Cannabis - Retailer Li… Ondyn Herschelle Corporation
## 5 C10-0000522-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## 6 C10-0000523-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## # ℹ 8 more variables: `Premise Address` <chr>, Status <chr>,
## # `Issue Date` <chr>, `Expiration Date` <chr>, Activities <chr>,
## # `Adult-Use/Medicinal` <chr>, lat <dbl>, long <dbl>
Hmm, looks like some of our addresses have an NA for
their lat and long. Let’s take a closer look.
summary(sf_mj_geo$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 37.71 37.75 37.78 37.77 37.78 37.80 11
summary(sf_mj_geo$long)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -122.5 -122.4 -122.4 -122.4 -122.4 -122.4 11
Looks like we have 10 addresses missing lat and 10 missing long. Let’s try this again using a different geocoding database called arcgis.
sf_mj_geo_arc <- geocode(sf_mj, "Premise Address",
method = "arcgis")
## Passing 33 addresses to the ArcGIS single address geocoder
## Query completed in: 15 seconds
head(sf_mj_geo_arc)
## # A tibble: 6 × 12
## `License Number` `License Type` `Business Owner` `Business Structure`
## <chr> <chr> <chr> <chr>
## 1 C10-0000614-LIC Cannabis - Retailer Li… Terry Muller Limited Liability C…
## 2 C10-0000586-LIC Cannabis - Retailer Li… Jeremy Goodin Corporation
## 3 C10-0000587-LIC Cannabis - Retailer Li… Justin Jarin Corporation
## 4 C10-0000539-LIC Cannabis - Retailer Li… Ondyn Herschelle Corporation
## 5 C10-0000522-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## 6 C10-0000523-LIC Cannabis - Retailer Li… Ryan Hudson Limited Liability C…
## # ℹ 8 more variables: `Premise Address` <chr>, Status <chr>,
## # `Issue Date` <chr>, `Expiration Date` <chr>, Activities <chr>,
## # `Adult-Use/Medicinal` <chr>, lat <dbl>, long <dbl>
summary(sf_mj_geo_arc$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.71 37.76 37.77 37.77 37.78 37.80
summary(sf_mj_geo_arc$long)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -122.5 -122.4 -122.4 -122.4 -122.4 -122.4
Woohoo! No missingness. Love to see it. OK, let’s plot these data and see how they look.
plot(sf_mj_geo_arc$long, sf_mj_geo_arc$lat)

We are in business! We have taken addresses and converted them into
latitude and longitude! I think we need a badge! 
Bonus exercise! Let’s take these addresses and reverse geocode
them. That’s just a fancy way of saying that we will take latitude
and longitude data and convert it into readable addresses. We use the
aptly named function reverse_geocode and specify which
columns to look at (lat and long), the method we want
for geocoding, and what we want the address column to be named.
Then we select out some columns that we aren’t really
interested in. Remember, we are doing this the tidy way
so we are using %>% pipes.
reverse <- sf_mj_geo_arc %>%
reverse_geocode(lat = lat, long = long, method = 'arcgis',
address = address_found) %>%
select(-`Business Owner`,-`Business Structure`,-`License Number`,-`License Type`,-Status,-`Issue Date`,-`Expiration Date`,-Activities,-`Adult-Use/Medicinal`)
## Passing 33 coordinates to the ArcGIS single coordinate geocoder
## Query completed in: 15.9 seconds
head(reverse)
## # A tibble: 6 × 4
## `Premise Address` lat long address_found
## <chr> <dbl> <dbl> <chr>
## 1 2165 IRVING ST san francisco, CA 94122 37.8 -122. Smokin D's BBQ, 2181 Irvi…
## 2 122 10TH ST SAN FRANCISCO, CA 941032605 37.8 -122. Urbana Weed Dispensary SO…
## 3 843 Howard ST SAN FRANCISCO, CA 94103 37.8 -122. Comart Business Syst, 843…
## 4 70 SECOND ST SAN FRANCISCO, CA 94105 37.8 -122. Cannavine San Francisco, …
## 5 527 Howard ST San Francisco, CA 94105 37.8 -122. Fixed, 527 Howard St, Ste…
## 6 2414 Lombard ST San Francisco, CA 94123 37.8 -122. Mile 7.6 US Hwy 101 N, Sa…
Looking at Premise Address and address_found we can see that we did pretty well! Not perfect, but most are the right address or a few doors down. Well done!
Although there are a few ways to work with vector spatial data in R, we will focus on the sf package in this course. The majority of spatial folks in R have shifted to sf for vector data, and so it makes sense to focus on it in the class.
Processing spatial data is very similar to nonspatial data thanks to the package sf, which is tidy friendly. sf stands for simple features. The Simple Features standard defines a simple feature as a representation of a real world object by a point or points that may or may not be connected by straight line segments to form lines or polygons. A feature is thought of as a thing, or an object in the real world, such as a building or a tree. A county can be a feature. As can a city and a neighborhood. Features have a geometry describing where on Earth the features are located, and they have attributes, which describe other properties.
Now let’s get our hands dirty working with some spatial data.
For this lab, we will primarily be working with the MapGAM package. If you go to the link, you can read the reference manual on the various datasets available in the package. For this lab, we will mainly be working with the CAdata dataset. While they are based on real patterns expected in observational epidemiologic studies, these data have been simulated and are for teaching purposes only. The data contain 5000 simulated ovarian cancer cases. This is a cohort with time to mortality measured, but for the purposes of our class, we will conduct simple tabular analyses looking at associations between spatial exposures with mortality at end of follow-up.
The CAdata dataset contains the following variables
So let’s bring in the CAdata dataset and have a look at it.
#Load CAdata dataset from MapGAM package
data(CAdata)
ca_pts <- CAdata
summary(ca_pts)
## time event X Y
## Min. : 0.004068 Min. :0.0000 Min. :1811375 Min. :-241999
## 1st Qu.: 1.931247 1st Qu.:0.0000 1st Qu.:2018363 1st Qu.: -94700
## Median : 4.749980 Median :1.0000 Median :2325084 Median : -60386
## Mean : 6.496130 Mean :0.6062 Mean :2230219 Mean : 87591
## 3rd Qu.: 9.609031 3rd Qu.:1.0000 3rd Qu.:2380230 3rd Qu.: 318280
## Max. :24.997764 Max. :1.0000 Max. :2705633 Max. : 770658
## AGE INS
## Min. :25.00 Mcd: 431
## 1st Qu.:53.00 Mcr:1419
## Median :62.00 Mng:2304
## Mean :61.28 Oth: 526
## 3rd Qu.:71.00 Uni: 168
## Max. :80.00 Unk: 152
OK, so the variables look great. Is it a spatial dataset that can be
recognized by R? Not just yet. We know that X is likely some
sort of longitude column and Y is likely some sort of latitude
column, although they don’t exactly look right. We have to tell R that
the X and Y coordinates are spatial data using the st_as_sf
function in sf. With this command, we can specify which
coordinates R should look at for longitude and latitude with the
coords=c() function.
ca_pts <- st_as_sf(CAdata, coords=c("X","Y"))
Let’s check the coordinate reference system (CRS) using the
st_crs command in the sf package.
st_crs(ca_pts)
## Coordinate Reference System: NA
Hmmm, NA. That still doesn’t look good. So how do we make this a spatial file? We will need to add a CRS.
Let’s add a CRS by using st_set_sf from the
sf package. We get the CRS for this dataset from the
MapGAM documentation (don’t worry–it took me forever to find this, but
usually this is much easier to find). Then we will double check the
CRS.
#Load the projection into an object called ca_proj
ca_proj <- "+proj=lcc +lat_1=40 +lat_2=41.66666666666666
+lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000
+y_0=500000.0000000002 +ellps=GRS80
+datum=NAD83 +units=m +no_defs"
#Set CRS
ca_pts_crs <- st_set_crs(ca_pts, ca_proj)
#Look at dataset
summary(ca_pts_crs)
## time event AGE INS
## Min. : 0.004068 Min. :0.0000 Min. :25.00 Mcd: 431
## 1st Qu.: 1.931247 1st Qu.:0.0000 1st Qu.:53.00 Mcr:1419
## Median : 4.749980 Median :1.0000 Median :62.00 Mng:2304
## Mean : 6.496130 Mean :0.6062 Mean :61.28 Oth: 526
## 3rd Qu.: 9.609031 3rd Qu.:1.0000 3rd Qu.:71.00 Uni: 168
## Max. :24.997764 Max. :1.0000 Max. :80.00 Unk: 152
## geometry
## POINT :5000
## epsg:NA : 0
## +proj=lcc ...: 0
##
##
##
#Check the CRS
st_crs(ca_pts_crs)
## Coordinate Reference System:
## User input: +proj=lcc +lat_1=40 +lat_2=41.66666666666666
## +lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000
## +y_0=500000.0000000002 +ellps=GRS80
## +datum=NAD83 +units=m +no_defs
## wkt:
## PROJCRS["unknown",
## BASEGEOGCRS["unknown",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6269]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]]],
## CONVERSION["unknown",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",39.3333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-122,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",40,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",41.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",2000000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]]
Nice! We have a spatial dataset. That geometry column is how
sf stores the geographic data, and latitude and
longitude there are looking a bit more like we would expect. And we
definitely have a full CRS with all sorts of info. OK, let’s plot our
data to make sure they look spatial! We will use the
tmap package to plot the points. We will first specify
the tmap_mode of “view” which is interactive. There’s also
a “plot” option which is nice for making exportable figures. We will
then create an object called cancer_map and then add a layer
with tm_shape(). This allows us to combine several maps
into one, or to add layers on top of each other. Then we have to specify
a level of that layer to display. Here we will use
tm_dots() to to plot the points. In our options, we specify
the size with size=.
tmap_mode("view")
## ℹ tmap modes "plot" - "view"
## ℹ toggle with `tmap::ttm()`
cancer_map = tm_shape(ca_pts_crs) + tm_dots(size=0.5)
cancer_map
## Registered S3 method overwritten by 'jsonify':
## method from
## print.json jsonlite
Let’s play around with some of the options. We can change the color
with the col= option. We can make the dots smaller by
specifying the size= option. And we can change the
transparency of the points with the alpha= option.
cancer_map_small = tm_shape(ca_pts_crs) + tm_dots(col = "blue", size = 0.3, alpha = 0.5)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_dots()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_dots()`: use `fill_alpha` instead of `alpha`.
## This message is displayed once every 8 hours.
cancer_map_small
Let’s map the points color coded by the variable event, or
whether or not the participant died over followup. We do that by
specifying that the color (col=) is based on the column
event. We have to specify that event is a categorical
variable with style="cat".
cancer_map_events = tm_shape(ca_pts_crs) + tm_dots(size=0.3, col="event", style="cat")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_dots()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
cancer_map_events
Let’s see if we can change up the color scheme.
cancer_map_events_rg = tm_shape(ca_pts_crs) + tm_dots(col = "event", palette = c("0" = "gray", "1" = "red"), style="cat")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_dots()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values') to
## 'tm_scale_categorical(<HERE>)'
cancer_map_events_rg
## Multiple palettes called "gray" found: "matplotlib.gray", "tableau.gray", "ocean.gray", "gmt.gray". The first one, "matplotlib.gray", is returned.
Looking good! You earned yourself a tmap badge. Now get yourself a cookie.
One of the primary sources of data that we’ll be using in this class is the United States Decennial Census and the American Community Survey. There are two ways to bring Census data into R: Using and API or downloading it from an online source.
Note that we will gather ACS data from all sources. Census boundaries changed in 2020, which means that 2016-2020 and later data will not completely merge with ACS data before 2020. So make sure you merge 2020 data only with 2020 data (but you can merge 2019 data with data between 2010-2019). This is especially important for tract data, with many new tracts created in 2020 and existing tracts experiencing dramatic changes in their boundaries between 2010 and 2020. See the impact of tract boundary changes between 2000 and 2010 here. You may also explore the Neighborhood Change Database which is available through the UC Davis library, and is a dataset that incorporates tract boundary changes over time. We are working on acquiring the 2020 data there!
The first way to obtain Census data is to download them directly from the web onto your hard drive. There are several websites where you can download Census data including Social Explorer and PolicyMap, which we have free access to as UC Davis affiliates, and the National Historical Geographic Information System (NHGIS), which is free for everyone. To find out how to download data from PolicyMap and NHGIS, check out tutorials here and here.
The other way to bring Census data into R is to use the Census Application Program Interface (API). An API allows for direct requests for data in machine-readable form. That is, rather than you having to navigate to some website, scroll around to find a dataset, download that dataset once you find it, save that data onto your hard drive, and then bring the data into R, you just tell R to retrieve data directly from the source using one or two lines of code.
In order to directly download data from the Census API, you need a key. You can sign up for a free key here, which you should have already done before the lab. Type your key in quotes using the census_api_key() command.
census_api_key("YOUR API KEY GOES HERE", install = TRUE)
The option install = TRUE saves the API key in your R
environment, which means you don’t have to run
census_api_key() every single time. The function for
downloading American Community Survey (ACS) Census data is
get_acs(). The command for downloading decennial Census
data is get_decennial(). Both functions come from the
tidycensus package, which allows users to interface
with the US Census Bureau’s decennial Census and American Community
Survey APIs. Getting variables using the Census API requires knowing the
variable ID - and there are thousands of variables (and thus thousands
of IDs) across the different Census files. To rapidly search for
variables, use the commands load_variables() and
View(). Because we’ll be using the ACS in this guide, let’s
check the variables in the most recent 2023 5-year ACS (2019-2023) using
the following commands.
acs2023 <- load_variables(2023, "acs5", cache = TRUE)
View(acs2023)
A window should have popped up showing you a record layout of the 2019-2023 ACS. To search for specific data, select “Filter” located at the top left of this window and use the search boxes that pop up. For example, type in “Hispanic” in the box under “Label”. You should see near the top of the list the first set of variables we’ll want to download - race/ethnicity. Another way of finding variable names is to search them using Social Explorer. Click on the appropriate survey data year and then “American Community Survey Tables”, which will take you to a list of variables with their Census IDs.
Let’s extract race/ethnicity data and total population for California
counties using the get_acs() command.
ca <- get_acs(geography = "county",
year = 2023,
variables = c(tpopr = "B03002_001",
nhwhite = "B03002_003", nhblk = "B03002_004",
nhasn = "B03002_006", hisp = "B03002_012"),
state = "CA",
survey = "acs5",
output = "wide")
## Getting data from the 2019-2023 5-year ACS
In the above code, we specified the following arguments
geography: The level of geography we want the data in;
in our case, the county. Other geographic options can be found here.year: The end year of the data (because we want
2016-2020, we use 2020).variables: The variables we want to bring in as
specified in a vector you create using the function c().
Note that we created variable names of our own (e.g. “nhwhite”) and we
put the ACS IDs in quotes (“B03002_003”). Had we not done this, the
variable names will come in as they are named in the ACS, which are not
very descriptive.state: We can filter the counties to those in a
specific state. Here it is “CA” for California. If we don’t specify
this, we get all counties in the United States.survey: The specific Census survey were extracting data
from. We want data from the 5-year American Community Survey, so we
specify “acs5”. The ACS comes in 1- and 5-year - varieties.output: The argument tells R to return a wide dataset
as opposed to a long dataset (see this vignette for more info).Another useful option to set is cache_table = TRUE, so
you don’t have to re-download after you’ve downloaded successfully the
first time. Type in ? get_acs() to see the full list of
options.
As you learned in Lab 1 and Lab 2, whenever you bring in a dataset, the
first thing you should always do is view it to get a sense of its
structure and to make sure you got what you expected. One way of doing
this is to use the glimpse() command
glimpse(ca)
## Rows: 58
## Columns: 12
## $ GEOID <chr> "06001", "06003", "06005", "06007", "06009", "06011", "06013"…
## $ NAME <chr> "Alameda County, California", "Alpine County, California", "A…
## $ tpoprE <dbl> 1651949, 1695, 41029, 209470, 45995, 21895, 1161458, 27293, 1…
## $ tpoprM <dbl> NA, 234, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ nhwhiteE <dbl> 466445, 993, 30234, 139527, 35599, 6869, 455961, 16668, 14254…
## $ nhwhiteM <dbl> 1170, 215, 341, 767, 318, 133, 1843, 199, 589, 979, 75, 554, …
## $ nhblkE <dbl> 159042, 0, 781, 3550, 529, 311, 94864, 805, 1522, 42060, 158,…
## $ nhblkM <dbl> 1736, 14, 143, 393, 140, 44, 1594, 123, 232, 1334, 126, 271, …
## $ nhasnE <dbl> 528377, 8, 587, 11010, 1066, 101, 212373, 821, 9640, 108809, …
## $ nhasnM <dbl> 2269, 8, 129, 497, 238, 159, 2008, 264, 416, 1332, 182, 403, …
## $ hispE <dbl> 385245, 249, 6361, 40829, 6403, 13639, 316799, 5350, 27230, 5…
## $ hispM <dbl> NA, 115, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
You get a quick, compact summary of your tibble. You can also use the
head() command, which shows you the first several rows of
your data object (tail() will give you the last several
rows).
head(ca)
## # A tibble: 6 × 12
## GEOID NAME tpoprE tpoprM nhwhiteE nhwhiteM nhblkE nhblkM nhasnE nhasnM hispE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 06001 Alam… 1.65e6 NA 466445 1170 159042 1736 528377 2269 385245
## 2 06003 Alpi… 1.69e3 234 993 215 0 14 8 8 249
## 3 06005 Amad… 4.10e4 NA 30234 341 781 143 587 129 6361
## 4 06007 Butt… 2.09e5 NA 139527 767 3550 393 11010 497 40829
## 5 06009 Cala… 4.60e4 NA 35599 318 529 140 1066 238 6403
## 6 06011 Colu… 2.19e4 NA 6869 133 311 44 101 159 13639
## # ℹ 1 more variable: hispM <dbl>
The tibble contains counties with their estimates for race/ethnicity. These variables end with the letter “E”. It also contains the margins of error for each estimate. These variables end with the letter “M”.
tidycensus is a game changer in being able to bring in Census data into R in a convenient, fast, efficient and tidy friendly way. We’ll be using this package in the next lab to bring in Census spatial data. And congratulations! You’ve just earned another badge. Fantastic!
To save us time, I’ve uploaded a PolicyMap (link
to .csv) on the Github for you to use in this lab. Save this file in
the same folder where your Lab 2 R Markdown file resides. To read in a
.csv file, first make sure that R is pointed to the folder you saved
your data into. Type in getwd() to find out the current
directory and setwd("directory name") to set the directory
to the folder containing the data.
From a Mac laptop, I type in the following command to set the directory to the folder containing my data.
setwd("/Users/pjames1/Dropbox/UC Davis Folders/SPH 215 GIS and Public Health/Github_Website/SPH215/")
For a Windows system, you can find the pathway of a file by right clicking on it and selecting Properties. You will find that instead of a forward slash like in a Mac, a windows pathway will be indicated by a single back slash . R doesn’t like this because it thinks of a single back slash as an escape character. Use instead two back slashes \
setwd("C:\\Users\\pjames\\Documents\\UCD\\SPH215\\Labs\\Lab 3")
or a forward slash /.
setwd("C:/Users/pjames/Documents/UCD/SPH215/Labs/Lab 3")
You can also manually set the working directory by clicking on Session -> Set Working Directory -> Choose Directory from the menu.
Once you’ve set your directory, use the function
read_csv(), which is a part of the
tidyverse package, and plug in the name of the file in
quotes inside the parentheses. Make sure you include the .csv
extension.
ca.pm <- read_csv("PolicyMap Data 2025-03-27 192555 UTC.csv", skip = 1)
## Rows: 58 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): GeoID_Description, GeoID_Name, SitsinState, GeoID, GeoID_Formatted,...
## dbl (2): mhhinc, GeoVintage
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The option skip = 1 tells R to skip the first row of the
file when bringing it in. This is done because there are two rows of
column names. The first row contains the extended version, while the
second is the abridged version. Above we keep the abridged version.
You should see a tibble ca.pm pop up in your Environment window (top right). What does our data set look like?
glimpse(ca.pm)
## Rows: 58
## Columns: 10
## $ GeoID_Description <chr> "County", "County", "County", "County", "County", "C…
## $ GeoID_Name <chr> "Alameda", "Alpine", "Amador", "Butte", "Calaveras",…
## $ SitsinState <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA"…
## $ GeoID <chr> "06001", "06003", "06005", "06007", "06009", "06011"…
## $ GeoID_Formatted <chr> "=\"06001\"", "=\"06003\"", "=\"06005\"", "=\"06007\…
## $ mhhinc <dbl> 126240, 110781, 81526, 68574, 79877, 75149, 125727, …
## $ TimeFrame <chr> "2019-2023", "2019-2023", "2019-2023", "2019-2023", …
## $ GeoVintage <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
## $ Source <chr> "Census", "Census", "Census", "Census", "Census", "C…
## $ Location <chr> "California (State)", "California (State)", "Califor…
If you like viewing your data through an Excel style worksheet, type
in View(ca.pm), and ca.pm should pop up in the top
left window of your R Studio interface.
We learned about the various data wrangling related functions from the tidyverse package in Lab 1. Let’s employ some of those functions to create a single county level dataset that joins the datasets we downloaded from the Census API and PolicyMap.
We are going to combine these datasets using the county FIPS codes. In the Census API and PolicyMap, these are contained in the variables GEOID and GeoID, respectively. Let’s make sure they are in the same class.
class(ca.pm$GeoID)
## [1] "character"
class(ca$GEOID)
## [1] "character"
One of the important innovations from the tidyverse is the pipe
operator %>%. You use the pipe operator when you want to
combine multiple operations into one line of continuous code. Let’s
create our final data object cacounty using our brand new
friend the pipe. Let’s put together a lot of what we’ve done over the
past few weeks into one step! Now we are cooking with gas!
cacounty <- ca %>%
left_join(ca.pm, by = c("GEOID" = "GeoID")) %>%
mutate(pwhite = nhwhiteE/tpoprE, pasian = nhasnE/tpoprE,
pblack = nhblkE/tpoprE, phisp = hispE/tpoprE,
mhisp = case_when(phisp > 0.5 ~ "Majority",
TRUE ~ "Not Majority")) %>%
rename(County = GeoID_Name) %>%
select(GEOID, County, pwhite, pasian, pblack, phisp, mhisp, mhhinc)
glimpse(cacounty)
## Rows: 58
## Columns: 8
## $ GEOID <chr> "06001", "06003", "06005", "06007", "06009", "06011", "06013", …
## $ County <chr> "Alameda", "Alpine", "Amador", "Butte", "Calaveras", "Colusa", …
## $ pwhite <dbl> 0.28236041, 0.58584071, 0.73689342, 0.66609538, 0.77397543, 0.3…
## $ pasian <dbl> 0.319850673, 0.004719764, 0.014306954, 0.052561226, 0.023176432…
## $ pblack <dbl> 0.0962753693, 0.0000000000, 0.0190353165, 0.0169475343, 0.01150…
## $ phisp <dbl> 0.2332064, 0.1469027, 0.1550367, 0.1949157, 0.1392108, 0.622927…
## $ mhisp <chr> "Not Majority", "Not Majority", "Not Majority", "Not Majority",…
## $ mhhinc <dbl> 126240, 110781, 81526, 68574, 79877, 75149, 125727, 66780, 1061…
Let’s break down what the pipe is doing here. First, you start out
with your dataset ca. You “pipe” that into the command
left_join(). Notice that you didn’t have to type in ca
inside that command - %>% pipes that in for you. The
command joins the data object ca.pm to ca. The result
of this function gets piped into the mutate() function,
which creates the percent race/ethnicity (from the Census API), and
majority Hispanic variables. This gets piped into the
rename() function, which renames the ambiguous variable
name GeoID_Name to the more descriptive name County.
This then gets piped into the final function, select(),
which keeps the necessary variables. Finally, the code saves the result
into cacounty which we designated at the beginning with the arrow
operator.
Piping makes code clearer, and simultaneously gets rid of the need to define any intermediate objects that you would have needed to keep track of while reading the code. PIPE, Pipe, and pipe whenever you can. We need some stinkin badges!
If you want to save your data frame or tibble as a csv file on your
hard drive, use the command write_csv(). Before you save a
file, make sure R is pointed to the appropriate folder on your hard
drive by using the function getwd(). If it’s not pointed to
the right folder, use the function setwd() to set the
appropriate working directory.
write_csv(cacounty, "lab2_file.csv")
The first argument is the name of the R object you want to save. The second argument is the name of the csv file in quotes. Make sure to add the .csv extension. The file is saved in your current working directory.
The functions above help us bring in and clean data. The next set of functions covered in this section will help us summarize the data. Data refer to pieces of information that describe a status or a measure of magnitude. A variable is a set of observations on a particular characteristic. The distribution of a variable is a listing showing all the possible values of the data for that variable and how often they occur. Exploratory Data Analysis (EDA) encompasses a set of methods (some would say a framework or perspective) for summarizing a variable’s distribution, and the relationship between the distributions of two or more variables. We will cover two general approaches to summarizing your data: descriptive statistics and visualization via graphs and charts.
When describing a distribution, your quantitative message is often best communicated by reducing data to a few summary numbers. These numbers are meant to summarize the “typical” value in the distribution (e.g., mean, median, mode) and the variation or “spread” in the distribution (e.g., minimum/maximum, interquartile range, standard deviation). These summary numbers are known as descriptive statistics.
We can use the function summarize() to get descriptive statistics of our data. For example, let’s calculate the mean household income in California counties. The first argument inside summarize() is the data object cacounty and the second argument is the function calculating the specific summary statistic, in this case mean().
cacounty %>%
summarize(Mean = mean(mhhinc))
## # A tibble: 1 × 1
## Mean
## <dbl>
## 1 87001.
The average county median household income is $87,001. If the
variable mhhinc contained missing values, we would have gotten
NA as a result. To omit missing values from the calculation,
you need to add rm = TRUE to mean().
We can calculate more than one summary statistic within
summarize(). What is the spread of the distribution? We can
add to summarize() the function sd() to
calculate the standard deviation.
cacounty %>%
summarize(Mean = mean(mhhinc), SD = sd(mhhinc))
## # A tibble: 1 × 2
## Mean SD
## <dbl> <dbl>
## 1 87001. 25547.
Does the average income differ by California region? First, let’s
create a new variable region designating each county as Bay Area,
Southern California, Central Valley, Capital Region and the Rest of
California using the case_when() function within the
mutate() function.
cacounty <- cacounty %>%
mutate(region = case_when(County == "Sonoma" | County == "Napa" |
County == "Solano" | County == "Marin" |
County == "Contra Costa" | County == "San Francisco" |
County == "San Mateo" | County == "Alameda" |
County == "Santa Clara" ~ "Bay Area",
County == "Imperial" | County == "Los Angeles" |
County == "Orange" | County == "Riverside" |
County == "San Diego" | County == "San Bernardino" |
County == "Ventura" ~ "Southern California",
County == "Fresno" | County == "Madera" |
County == "Mariposa" | County == "Merced" |
County == "Tulare" |
County == "Kings" ~ "Central Valley",
County == "Alpine" | County == "Colusa" |
County == "El Dorado" | County == "Glenn" |
County == "Placer" | County == "Sacramento" |
County == "Sutter" | County == "Yolo" |
County == "Yuba" ~ "Capital Region",
TRUE ~ "Rest"))
Next, we need to pair summarize() with the function
group_by(). The function group_by() tells R to
run subsequent functions on the data object by a group characteristic
(such as gender, educational attainment, or in this case, region). We’ll
need to use our new best friend %>% to accomplish this task.
cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc))
## # A tibble: 5 × 2
## region Mean
## <chr> <dbl>
## 1 Bay Area 129297.
## 2 Capital Region 89288.
## 3 Central Valley 69265.
## 4 Rest 74959.
## 5 Southern California 91332.
The first pipe sends cacounty into the function
group_by(), which tells R to group cacounty by the
variable region.
How do you know the tibble is grouped? Because it tells you!
cacounty %>%
group_by(region)
## # A tibble: 58 × 9
## # Groups: region [5]
## GEOID County pwhite pasian pblack phisp mhisp mhhinc region
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 06001 Alameda 0.282 0.320 0.0963 0.233 Not Majority 126240 Bay Area
## 2 06003 Alpine 0.586 0.00472 0 0.147 Not Majority 110781 Capital …
## 3 06005 Amador 0.737 0.0143 0.0190 0.155 Not Majority 81526 Rest
## 4 06007 Butte 0.666 0.0526 0.0169 0.195 Not Majority 68574 Rest
## 5 06009 Calaveras 0.774 0.0232 0.0115 0.139 Not Majority 79877 Rest
## 6 06011 Colusa 0.314 0.00461 0.0142 0.623 Majority 75149 Capital …
## 7 06013 Contra Costa 0.393 0.183 0.0817 0.273 Not Majority 125727 Bay Area
## 8 06015 Del Norte 0.611 0.0301 0.0295 0.196 Not Majority 66780 Rest
## 9 06017 El Dorado 0.741 0.0501 0.00791 0.142 Not Majority 106190 Capital …
## 10 06019 Fresno 0.270 0.108 0.0416 0.541 Majority 71434 Central …
## # ℹ 48 more rows
The second pipe takes this grouped dataset and sends it into the
summarize() command, which calculates the mean income (by
region, because the dataset is grouped by region).
To get the mean, median and standard deviation of median income, its correlation with percent Hispanic, and give column labels for the variables in the resulting summary table, we type in:
cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc),
Median = median(mhhinc),
SD = sd(mhhinc),
Correlation = cor(mhhinc, phisp))
## # A tibble: 5 × 5
## region Mean Median SD Correlation
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bay Area 129297. 126240 22246. -0.620
## 2 Capital Region 89288. 88724 17295. -0.788
## 3 Central Valley 69265. 69120. 3918. 0.423
## 4 Rest 74959. 71931 15803. 0.606
## 5 Southern California 91332. 89672 19132. -0.951
The variable mhhinc is numeric. How do we summarize
categorical variables? We usually summarize categorical variables by
examining a frequency table. To get the percent of counties that have a
majority Hispanic population mhisp, you’ll need to combine the functions
group_by(), summarize() and mutate() using
%>%.
cacounty %>%
group_by(mhisp) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 2 × 3
## mhisp n freq
## <chr> <int> <dbl>
## 1 Majority 12 0.207
## 2 Not Majority 46 0.793
The code group_by(mhisp) separates the counties by the
categories of mhisp (Majority, Not Majority). We then used
summarize() to count the number of counties that are
Majority and Not Majority. The function to get a count is
n(), and we saved this count in a variable named
n. Next, we used mutate() on this table to get the
proportion of counties by Majority Hispanic designation. The code
sum(n) adds the values of n. We then divide the
value of each n by this sum. That yields the final frequency table.
Instead of calculating descriptive statistics one at a time using
summarize(), you can obtain a set of summary statistics for
one or all the numeric variables in your dataset using the
summary() function.
summary(cacounty)
## GEOID County pwhite pasian
## Length:58 Length:58 Min. :0.09421 Min. :0.00000
## Class :character Class :character 1st Qu.:0.31535 1st Qu.:0.01888
## Mode :character Mode :character Median :0.49018 Median :0.04404
## Mean :0.50723 Mean :0.07748
## 3rd Qu.:0.66599 3rd Qu.:0.08630
## Max. :0.88832 Max. :0.39306
## pblack phisp mhisp mhhinc
## Min. :0.000000 Min. :0.05892 Length:58 Min. : 53498
## 1st Qu.:0.009634 1st Qu.:0.15611 Class :character 1st Qu.: 67888
## Median :0.016844 Median :0.27110 Mode :character Median : 80702
## Mean :0.028112 Mean :0.32170 Mean : 87001
## 3rd Qu.:0.033734 3rd Qu.:0.46747 3rd Qu.:102701
## Max. :0.125967 Max. :0.85579 Max. :159674
## region
## Length:58
## Class :character
## Mode :character
##
##
##
The output from the descriptive statistics we’ve ran so far is not presentation ready. For example, taking a screenshot of the following results table produces unnecessary information that is confusing and messy.
cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc),
Median = median(mhhinc),
SD = sd(mhhinc),
Correlation = cor(mhhinc, phisp))
## # A tibble: 5 × 5
## region Mean Median SD Correlation
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bay Area 129297. 126240 22246. -0.620
## 2 Capital Region 89288. 88724 17295. -0.788
## 3 Central Valley 69265. 69120. 3918. 0.423
## 4 Rest 74959. 71931 15803. 0.606
## 5 Southern California 91332. 89672 19132. -0.951
Furthermore, you would like to show a table, say, in a manuscript that does not require you to take a screenshot or copying and pasting into Excel, but instead can be produced via code, that way it can be fixed if there is an issue, and is reproducible.
One way of producing presentation tables in R is through the flextable package. First, you will need to save the tibble or data frame of results into an object. For example, let’s save the above results into an object named region.summary
region.summary <- cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc),
Median = median(mhhinc),
SD = sd(mhhinc),
Correlation = cor(mhhinc, phisp))
You then input the object into the function flextable().
Save it into an object called my_table
my_table <- flextable(region.summary)
my_table
region | Mean | Median | SD | Correlation |
|---|---|---|---|---|
Bay Area | 129,297.33 | 126,240.0 | 22,246.495 | -0.6196837 |
Capital Region | 89,287.78 | 88,724.0 | 17,294.860 | -0.7878129 |
Central Valley | 69,265.17 | 69,119.5 | 3,918.442 | 0.4231022 |
Rest | 74,958.74 | 71,931.0 | 15,803.202 | 0.6064605 |
Southern California | 91,331.86 | 89,672.0 | 19,131.821 | -0.9513650 |
You should see a relatively clean table pop up either in your console or Viewer window.
What kind of object is my_table?
class(my_table)
## [1] "flextable"
After doing this, we can progressively pipe the my_table
object through more flextable formatting functions. For
example, you can change the column header names using the function
set_header_labels() and center the header names using the
function align().
my_table <- my_table %>%
set_header_labels(
region = "Region",
Mean = "Mean",
Median = "Median",
SD = "Standard Deviation",
Correlation = "Correlation") %>%
flextable::align(align = "center", part = "all")
my_table
Region | Mean | Median | Standard Deviation | Correlation |
|---|---|---|---|---|
Bay Area | 129,297.33 | 126,240.0 | 22,246.495 | -0.6196837 |
Capital Region | 89,287.78 | 88,724.0 | 17,294.860 | -0.7878129 |
Central Valley | 69,265.17 | 69,119.5 | 3,918.442 | 0.4231022 |
Rest | 74,958.74 | 71,931.0 | 15,803.202 | 0.6064605 |
Southern California | 91,331.86 | 89,672.0 | 19,131.821 | -0.9513650 |
Well doesn’t that look spiffy! There are a slew of options for formatting your table, including adding footnotes, borders, shade and other features. Check out this useful tutorial for an explanation of some of these features.
Once you’re done formatting your table, you can then export it to
Word, PowerPoint or HTML or as an image (PNG) files. To do this, use one
of the following functions: save_as_docx(),
save_as_pptx(), save_as_image(), and
save_as_html().
Use the save_as_image() function to save your table as
an image.
save_as_image(my_table, path = "reg_income.png")
## [1] "reg_income.png"
You first put in the table my_table, and set the file name with the .png extension. Check your working directory. You should see the file reg_income.png.
Another way of summarizing variables and their relationships is
through graphs and charts. The main package for R graphing is
ggplot2 which is a part of the tidyverse package. The
graphing function is ggplot() and it takes on the basic
template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(x, y)) +
<OPTIONS>()
ggplot() is the base function where you specify your
dataset using the data = argument.<OPTIONS>() are a set of functions you can
specify to change the look of the graph, for example relabeling the axes
or adding a title.The basic idea is that a ggplot graphic layers geometric objects (circles, lines, etc), themes, and scales on top of data.
You first start out with the base layer. It represents the empty
ggplot layer defined by the ggplot() function.
cacounty %>%
ggplot()

We get an empty plot. We haven’t told ggplot() what type
of geometric object(s) we want to plot, nor how the variables should be
mapped to the geometric objects, so we just have a blank plot. We have
geoms to paint the blank canvas.
From here, we add a “geom” layer to the ggplot object.
Layers are added to ggplot objects using +, instead of
%>%, since you are not explicitly piping an object into
each subsequent layer, but adding layers on top of one another. Each
geom is associated with a specific type of graph.
Let’s go through some of the more common and popular graphs for visualizing your data.
A typical visual for summarizing a single numeric variable is a
histogram. To create a histogram, use geom_histogram() for
<GEOM_FUNCTION()>. Let’s create a histogram of median household
income. Note that we don’t need to specify the y= here
because we are plotting only one variable. We pipe in the object
cacounty into ggplot() to establish the base
layer. We then use geom_histogram() to add the data layer
on top of the base.
cacounty %>%
ggplot() +
geom_histogram(mapping = aes(x=mhhinc))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can continue to add layers to the plot. For example, we use the
argument xlab("Median household income") to label the
x-axis as “Median household income”.
cacounty %>%
ggplot() +
geom_histogram(mapping = aes(x=mhhinc)) +
xlab("Median household income")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note the message produced with the plot. It tells us that we can use
the bins = argument to change the number of bins used to
produce the histogram. You can increase the number of bins to make the
bins narrower and thus get a finer grain of detail. Or you can decrease
the number of bins to get a broader visual summary of the shape of the
variable’s distribution. Compare bins = 10 to bins = 50.
cacounty %>%
ggplot() +
geom_histogram(mapping = aes(x=mhhinc), bins=10) +
xlab("Median household income")

We can use a boxplot to visually summarize the distribution
of a single numeric variable or the relationship between a categorical
and numeric variable. Use geom_boxplot() for
<GEOM_FUNCTION()> to create a boxplot. Let’s examine median
household income. Note that a boxplot uses y= rather than x= to specify
where mhhinc goes. We also provide a descriptive y-axis label using the
ylab() function.
cacounty %>%
ggplot() +
geom_boxplot(mapping = aes(y = mhhinc)) +
ylab("Median household income")

Let’s examine the distribution of median income by mhisp.
Because we are examining the association between two variables, we need
to specify x and y variables in aes() (we also specify both
x- and y-axis labels).
cacounty %>%
ggplot() +
geom_boxplot(mapping = aes(x = mhisp, y = mhhinc)) +
xlab("Majority Hispanic") +
ylab("Median household income")

The top and bottom of a boxplot represent the 75th and 25th percentiles, respectively. The line in the middle of the box is the 50th percentile. Points outside the whiskers represent outliers. Outliers are defined as having values that are either larger than the 75th percentile plus 1.5 times the IQR or smaller than the 25th percentile minus 1.5 times the IQR.
The boxplot is for all counties combined. Use the
facet_wrap() function to separate by region. Notice the
tilde ~ before the variable region inside facet_wrap().
cacounty %>%
ggplot() +
geom_boxplot(mapping = aes(x = mhisp, y = mhhinc)) +
xlab("Majority Hispanic") +
ylab("Median household income") +
facet_wrap(~region)

The primary purpose of a bar chart is to illustrate and compare the
values for a categorical variable. Bar charts show either the number or
frequency of each category. To create a bar chart, use
geom_bar() for ggplot().
cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc)) %>%
ggplot(aes(x=region, y = Mean)) +
geom_bar(stat = "Identity") +
xlab("Region") +
ylab("Average household income")

Right now the bars are ordered based on the region names. We can
order the bars in descending order based on household income by using
the reorder() function. Notice the negative sign in front
of Mean to order by descending order.
cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc)) %>%
ggplot(aes(x=reorder(region, -Mean), y = Mean)) +
geom_bar(stat = "Identity") +
xlab("Region") +
ylab("Average household income")

We can flip the axes using the function
coord_flip().
cacounty %>%
group_by(region) %>%
summarize(Mean = mean(mhhinc)) %>%
ggplot(aes(x=reorder(region, -Mean), y = Mean)) +
geom_bar(stat = "Identity") +
xlab("Region") +
ylab("Average household income") +
coord_flip()

Well aren’t we fancy? ggplot() is a powerful function,
and you can make a lot of visually captivating graphs. We have just
scratched the surface of its functions and features. You can also make
your graphs really “pretty” and professional looking by altering
graphing features, including colors, labels, titles and axes. For a list
of ggplot() functions that alter various features of a
graph, check out Chapter 28 in RDS.
Here’s your ggplot2 badge. Wear it with pride! OK, we’ve done a lot today. Get outside and get some fresh air!
In this lab, we are going to work more with vector data. We will learn how to visualize Census data, we will talk about how to wrangle spatial data, and we will get into some really cool ways to create choropleth maps (maps color coded by attributes). Finally, we will introduce the concept of rasters.
The objectives of this guide are to teach you:
Let’s get cracking!
First, let’s install our packages.
library(sf)
library(MapGAM)
library(tidyverse)
library(tidycensus)
library(flextable)
library(RColorBrewer)
library(tmap)
library(terra)
## terra 1.8.29
##
## Attaching package: 'terra'
## The following objects are masked from 'package:flextable':
##
## align, colorize, rotate, width
## The following object is masked from 'package:tidyr':
##
## extract
In Lab 2, we worked with the tidycensus
package and the Census API to bring in Census data into R. We can use
the same commands to bring in Census geography. If you haven’t already,
make sure to sign
up for and install your Census API key. If you could not install
your API key, you’ll need to use census_api_key() to
activate it with the following code:
census_api_key("YOUR API KEY GOES HERE", install = TRUE)
Use the set_acs() command to bring in California
tract-level race/ethnicity counts, total population, and total number of
households. How did I find the variable IDs? Check Lab 2. Since we want tracts, we’ll use the
geography = "tract" argument.
ca.tracts <- get_acs(geography = "tract",
year = 2023,
variables = c(tpopr = "B03002_001",
nhwhite = "B03002_003", nhblk = "B03002_004",
nhasn = "B03002_006", hisp = "B03002_012"),
state = "CA",
output = "wide",
survey = "acs5",
geometry = TRUE,
cb = FALSE)
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
The only difference between the code above and what we used in Lab 2 is we have one additional argument added
to the get_acs() command: geometry = TRUE.
This tells R to bring in the spatial features associated with the
geography you specified in the command, in the above case California
tracts. You can set cache_table = TRUE so that you don’t
have to re-download after you’ve downloaded successfully the first time.
This is important because you might be downloading a really large file,
or may encounter Census FTP issues when trying to collect data.
Note: We can also download the data another way. We can go to the Census
Shapefiles website and navigate to 2023, Census Tracts, then
California. We can then download a .zip file that contains an ESRI
shapefile of the Census tracts for California. When we unzip the file,
we see a series of files. Thankfully, the sf package
has an st_read() function that can tackle this! For more
detailed data downloads, you can use National Historical Geographic Information
System (NHGIS). The code below is example of how we might bring in a
shapefile, just for future reference!
ca.tracts <- st_read("/Users/pjames1/Downloads/tl_2024_06_tract/tl_2024_06_tract.shp")
OK, let’s go back to the data we got from tidycensus. Lets take a look at our data.
ca.tracts
## Simple feature collection with 9129 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.482 ymin: 32.52951 xmax: -114.1312 ymax: 42.0095
## Geodetic CRS: NAD83
## First 10 features:
## GEOID NAME tpoprE
## 1 06001442700 Census Tract 4427; Alameda County; California 2998
## 2 06001442800 Census Tract 4428; Alameda County; California 3087
## 3 06037204920 Census Tract 2049.20; Los Angeles County; California 2459
## 4 06037205110 Census Tract 2051.10; Los Angeles County; California 3690
## 5 06037320101 Census Tract 3201.01; Los Angeles County; California 3402
## 6 06037205120 Census Tract 2051.20; Los Angeles County; California 3409
## 7 06037206010 Census Tract 2060.10; Los Angeles County; California 3490
## 8 06037206020 Census Tract 2060.20; Los Angeles County; California 8179
## 9 06037206050 Census Tract 2060.50; Los Angeles County; California 2290
## 10 06037111402 Census Tract 1114.02; Los Angeles County; California 5998
## tpoprM nhwhiteE nhwhiteM nhblkE nhblkM nhasnE nhasnM hispE hispM
## 1 475 921 168 119 115 1577 469 308 81
## 2 386 779 183 12 49 1671 356 561 181
## 3 387 68 33 49 51 0 14 2331 387
## 4 446 14 14 8 12 0 14 3642 446
## 5 438 236 131 65 93 123 164 2978 481
## 6 624 15 24 8 10 70 104 3303 628
## 7 605 500 264 17 75 1030 235 1919 519
## 8 350 1379 165 2633 344 238 76 3541 285
## 9 509 138 49 18 22 151 82 1932 519
## 10 1024 2685 831 18 19 920 563 2322 433
## geometry
## 1 MULTIPOLYGON (((-122.0172 3...
## 2 MULTIPOLYGON (((-122.0023 3...
## 3 MULTIPOLYGON (((-118.2028 3...
## 4 MULTIPOLYGON (((-118.2196 3...
## 5 MULTIPOLYGON (((-118.4388 3...
## 6 MULTIPOLYGON (((-118.2202 3...
## 7 MULTIPOLYGON (((-118.2392 3...
## 8 MULTIPOLYGON (((-118.2379 3...
## 9 MULTIPOLYGON (((-118.2287 3...
## 10 MULTIPOLYGON (((-118.5023 3...
The object looks much like a basic tibble, but with a few differences.
Geometry Type indicates that the spatial data are
in MULTIPOLYGON form (as opposed to points or lines, the
other basic vector data forms).Bounding box indicates the spatial extent of the
features (from left to right, for example, California tracts go from a
longitude of -124.482 to -114.1312).Geodetic CRS tells us the coordinate reference
system.At its most basic, an sf object is a collection of simple features that includes attributes and geometries in the form of a data frame. In other words, it is a data frame (or tibble) with rows of features, columns of attributes, and a special column always named geometry that contains the spatial aspects of the features.
If you want to peek behind the curtain and learn more about the nitty gritty details about simple features, check out the official sf vignette.
There is a lot of stuff behind
the curtain of how R handles spatial data as simple features, but
the main takeaway is that sf objects are data frames.
This means you can use many of the tidyverse functions
we’ve learned in the past couple labs to manipulate sf
objects, including the pipe %>% operator. For example,
let’s break up the column NAME into separate tract, county and
state variables using the separate() function
We do all of this in one line of continuous code using the pipe
operator %>%
ca.tracts <- ca.tracts %>%
separate(NAME, c("Tract", "County", "State"), sep = "; ")
glimpse(ca.tracts)
## Rows: 9,129
## Columns: 15
## $ GEOID <chr> "06001442700", "06001442800", "06037204920", "06037205110", "…
## $ Tract <chr> "Census Tract 4427", "Census Tract 4428", "Census Tract 2049.…
## $ County <chr> "Alameda County", "Alameda County", "Los Angeles County", "Lo…
## $ State <chr> "California", "California", "California", "California", "Cali…
## $ tpoprE <dbl> 2998, 3087, 2459, 3690, 3402, 3409, 3490, 8179, 2290, 5998, 4…
## $ tpoprM <dbl> 475, 386, 387, 446, 438, 624, 605, 350, 509, 1024, 578, 621, …
## $ nhwhiteE <dbl> 921, 779, 68, 14, 236, 15, 500, 1379, 138, 2685, 2306, 11, 81…
## $ nhwhiteM <dbl> 168, 183, 33, 14, 131, 24, 264, 165, 49, 831, 535, 27, 73, 18…
## $ nhblkE <dbl> 119, 12, 49, 8, 65, 8, 17, 2633, 18, 18, 93, 26, 2, 352, 1086…
## $ nhblkM <dbl> 115, 49, 51, 12, 93, 10, 75, 344, 22, 19, 97, 41, 4, 264, 679…
## $ nhasnE <dbl> 1577, 1671, 0, 0, 123, 70, 1030, 238, 151, 920, 665, 13, 29, …
## $ nhasnM <dbl> 469, 356, 14, 14, 164, 104, 235, 76, 82, 563, 176, 21, 45, 26…
## $ hispE <dbl> 308, 561, 2331, 3642, 2978, 3303, 1919, 3541, 1932, 2322, 134…
## $ hispM <dbl> 81, 181, 387, 446, 481, 628, 519, 285, 519, 433, 265, 618, 18…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-122.0172 3..., MULTIPOLYGON (((…
Another important data wrangling operation is to join attribute data to an sf object. For example, let’s say you wanted to add tract level median household income, which is located in the file ca_med_inc_2018.csv. Read the file in.
ca.inc <- get_acs(geography = "tract",
year = 2023,
variables = c(medinc = "B19013_001"),
state = "CA",
survey = "acs5",
output = "wide")
## Getting data from the 2019-2023 5-year ACS
Unlike before, we brought these data in without the
geometry = TRUE option. So this is just a table. But
remember, an sf object is a data frame, so we can use
left_join(), which we covered in Lab 1, to join the files ca.inc and
ca.tracts.
ca.tracts <- ca.tracts %>%
left_join(ca.inc, by = "GEOID")
#take a look to make sure the join worked
glimpse(ca.tracts)
## Rows: 9,129
## Columns: 18
## $ GEOID <chr> "06001442700", "06001442800", "06037204920", "06037205110", "…
## $ Tract <chr> "Census Tract 4427", "Census Tract 4428", "Census Tract 2049.…
## $ County <chr> "Alameda County", "Alameda County", "Los Angeles County", "Lo…
## $ State <chr> "California", "California", "California", "California", "Cali…
## $ tpoprE <dbl> 2998, 3087, 2459, 3690, 3402, 3409, 3490, 8179, 2290, 5998, 4…
## $ tpoprM <dbl> 475, 386, 387, 446, 438, 624, 605, 350, 509, 1024, 578, 621, …
## $ nhwhiteE <dbl> 921, 779, 68, 14, 236, 15, 500, 1379, 138, 2685, 2306, 11, 81…
## $ nhwhiteM <dbl> 168, 183, 33, 14, 131, 24, 264, 165, 49, 831, 535, 27, 73, 18…
## $ nhblkE <dbl> 119, 12, 49, 8, 65, 8, 17, 2633, 18, 18, 93, 26, 2, 352, 1086…
## $ nhblkM <dbl> 115, 49, 51, 12, 93, 10, 75, 344, 22, 19, 97, 41, 4, 264, 679…
## $ nhasnE <dbl> 1577, 1671, 0, 0, 123, 70, 1030, 238, 151, 920, 665, 13, 29, …
## $ nhasnM <dbl> 469, 356, 14, 14, 164, 104, 235, 76, 82, 563, 176, 21, 45, 26…
## $ hispE <dbl> 308, 561, 2331, 3642, 2978, 3303, 1919, 3541, 1932, 2322, 134…
## $ hispM <dbl> 81, 181, 387, 446, 481, 628, 519, 285, 519, 433, 265, 618, 18…
## $ NAME <chr> "Census Tract 4427; Alameda County; California", "Census Trac…
## $ medincE <dbl> 199154, 180800, 70500, 52262, 110967, 28516, 60703, 148661, 4…
## $ medincM <dbl> 30525, 28293, 15698, 5659, 22149, 3246, 28578, 39223, 19295, …
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-122.0172 3..., MULTIPOLYGON (((…
Note that we can’t use left_join() to join the attribute
tables of two sf files. You will need to either make
one of them not spatial by using the st_drop_geometry()
function or use the st_join() function to spatially join
them.
We use the function tm_shape() from the
tmap package to map the data.
tmap_mode("plot")
## ℹ tmap modes "plot" - "view"
tract_map <- tm_shape(ca.tracts) + tm_polygons()
tract_map

There is Data Wrangling and then there is Spatial Data Wrangling. Cue
dangerous sounding music. Well, it’s not that dangerous or scary.
Spatial Data Wrangling involves cleaning or altering your data set based
on the geographic location of features. The sf package
offers a suite of functions unique to wrangling spatial data. Most of
these functions start out with the prefix st_. To see all
of the functions, type in
methods(class = "sf")
We won’t go through all of these functions as the list is quite
extensive. But, we’ll go through a few relevant spatial operations for
this class below. The function we will be primarily using is
st_join().
A common spatial data wrangling issue is to subset a set of spatial
objects based on their location relative to another spatial object. In
our case, we want to keep California tracts that are in the Sacramento
metro area. We can do this using the st_join() function.
We’ll need to specify a type of join. Let’s first try
join = st_intersects. First, let’s bring in a polygon of
the Sacramento metro area from Github.
url <- "https://github.com/pjames-ucdavis/SPH215/raw/main/sac.metro.rds"
download.file(url, destfile = "sac.metro.rds", mode = "wb")
sac.metro <- readRDS("sac.metro.rds")
Let’s take a look at sac.metro and understand what file it is.
glimpse(sac.metro)
## Rows: 1
## Columns: 14
## $ CSAFP <chr> "472"
## $ CBSAFP <chr> "40900"
## $ GEOID <chr> "40900"
## $ GEOIDFQ <chr> "310M700US40900"
## $ NAME <chr> "Sacramento-Roseville-Folsom, CA"
## $ NAMELSAD <chr> "Sacramento-Roseville-Folsom, CA Metro Area"
## $ LSAD <chr> "M1"
## $ MEMI <chr> "1"
## $ MTFCC <chr> "G3110"
## $ ALAND <dbl> 13191309279
## $ AWATER <dbl> 548018355
## $ INTPTLAT <chr> "+38.7902715"
## $ INTPTLON <chr> "-121.0056427"
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-120.0064 3...
OK, that geometry looks good. And it’s a polygon, so that’s good. Let’s now try to intersect our sac.metro dataset with our ca.tracts dataset.
sac.metro.tracts.int <- st_join(ca.tracts, sac.metro,
join = st_intersects, left=FALSE)
The above code tells R to identify the polygons in ca.tracts
that intersect with the polygon sac.metro. We indicate we want
a polygon intersection by specifying join = st_intersects.
The option left=FALSE tells R to remove the polygons from
ca.tracts that do not intersect (make it TRUE and see what
happens) with sac.metro. Plotting our tracts, we get:
tm_shape(sac.metro.tracts.int) +
tm_polygons(col = "blue") +
tm_shape(sac.metro) +
tm_borders(col = "red")

We have one small issue. Using join = st_intersects
returns all tracts that intersect sac.metro, which include
those that touch the metro’s boundary. No bueno. We can instead use the
argument join = st_within to return tracts that are
completely within the metro area.
sac.metro.tracts.w <- st_join(ca.tracts, sac.metro, join = st_within, left=FALSE)
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "blue") +
tm_shape(sac.metro) +
tm_borders(col = "red")

Looking much better! Now, if we look at sac.metro.tracts.w’s
attribute table, you’ll see it includes all the variables from both
ca.tracts and sac.metro. We don’t need these
variables, so use select() to eliminate them. You’ll also
notice that if variables from two data sets share the same name, R will
keep both and attach a .x and .y to the end. For
example, I was found in both ca.tracts and sac.metro,
so R named one GEOID.x and the other that was merged in was
named GEOID.y.
OK, so now we’ve talked a little about how to bring in and manipulate vector polygon data, let’s do some mapping and create some choropleth maps. We can do this with the ggplot package, the tmap package, and the leaflet package (which we won’t cover now, but it’s very cool for interactive maps). Let’s start with ggplot.
Because sf is tidy friendly, it is no surprise we
can use the tidyverse plotting function
ggplot() to make maps. We already received an introduction
to ggplot() in Lab 2. Recall
its basic structure:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(x, y)) +
<OPTIONS>()
In mapping, geom_sf() is
<GEOM_FUNCTION>(). Unlike with functions like
geom_histogram() and geom_boxplot(), we don’t
specify an x and y axis. Instead you use fill if you want
to map a variable or color to just map boundaries.
Let’s use ggplot() to make a choropleth map. We need to
specify a numeric variable in the fill = argument within
geom_sf(). Here we map tract-level median household income
in the Sacramento metro area.
ggplot(data = sac.metro.tracts.w) +
geom_sf(aes(fill = medincE))

We can also specify a title (as well as subtitles and captions) using
the labs() function.
ggplot(data = sac.metro.tracts.w) +
geom_sf(aes(fill = medincE)) +
labs(title = "Median Income Sacramento MSA Tracts")

We can make further layout adjustments to the map. Don’t like a blue
scale on the legend? You can change it using the
scale_file_gradient() function. Let’s change it to a white
to red gradient. We can also eliminate the gray tract border colors to
make the fill color distinction clearer. We do this by specifying
color = NA inside geom_sf(). We can also get
rid of the gray background by specifying a basic black and white theme
using theme_bw(). We also added a caption indicating the
source of the data using the captions = parameter within
labs(). We then changed the color to red using labels for
low= and high=, and we added a name to our
legend with `name=’.
ggplot(data = sac.metro.tracts.w) +
geom_sf(aes(fill = medincE), color = NA) +
scale_fill_gradient(low= "white", high = "red", na.value ="gray", name = "Median Income") +
labs(title = "Median Income Sacramento MSA Tracts",
caption = "Source: American Community Survey") +
theme_bw()

Dare I say, we are ready for the New York Times with this map!
OK, now that we have mapped points and mapped polygons, let’s put them both together! First, we are going to bring in our old friend CAdata from last week’s lab.
data(CAdata)
ca_pts <- CAdata
summary(ca_pts)
## time event X Y
## Min. : 0.004068 Min. :0.0000 Min. :1811375 Min. :-241999
## 1st Qu.: 1.931247 1st Qu.:0.0000 1st Qu.:2018363 1st Qu.: -94700
## Median : 4.749980 Median :1.0000 Median :2325084 Median : -60386
## Mean : 6.496130 Mean :0.6062 Mean :2230219 Mean : 87591
## 3rd Qu.: 9.609031 3rd Qu.:1.0000 3rd Qu.:2380230 3rd Qu.: 318280
## Max. :24.997764 Max. :1.0000 Max. :2705633 Max. : 770658
## AGE INS
## Min. :25.00 Mcd: 431
## 1st Qu.:53.00 Mcr:1419
## Median :62.00 Mng:2304
## Mean :61.28 Oth: 526
## 3rd Qu.:71.00 Uni: 168
## Max. :80.00 Unk: 152
ca_pts <- st_as_sf(CAdata, coords=c("X","Y"))
ca_proj <- "+proj=lcc +lat_1=40 +lat_2=41.66666666666666
+lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000
+y_0=500000.0000000002 +ellps=GRS80
+datum=NAD83 +units=m +no_defs"
#Set CRS
ca_pts_crs <- st_set_crs(ca_pts, ca_proj)
This time, we will map those points with ggplot.
ggplot(data = ca_pts_crs) +
geom_sf(fill = "black") +
labs(title = "Study Participants",
caption = "Source: Ovarian Cancer Cases") +
theme_bw()

We can overlay the points over Sacramento tracts to give the
locations some perspective. Here, you add two geom_sf()
arguments for the tracts and the cancer cases.
ggplot() +
geom_sf(data = sac.metro.tracts.w) +
geom_sf(data = ca_pts_crs, fill = "black") +
labs(title = "Study Participants",
caption = "Source: Ovarian Cancer Cases") +
theme_bw()

Hmmm. That doesn’t look great. We have lots of cases outside of Sacramento. Let’s filter out to just pick cases within the Sacramento area.
ca_pts_crs.w <- st_join(ca_pts_crs, sac.metro.tracts.w, join = st_within, left=FALSE)
Ooof. That doesn’t work. It says our
st_crs(x) == st_crs(y) is not TRUE. That means our
Coordinate Reference Systems are not matching! Let’s transform our
cancer dataset ca_pts_crs.w to match the CRS for
sac.metro.tracts.w with one easy step using
st_transform:
#check crs of each dataset
st_crs(ca_pts_crs)
## Coordinate Reference System:
## User input: +proj=lcc +lat_1=40 +lat_2=41.66666666666666
## +lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000
## +y_0=500000.0000000002 +ellps=GRS80
## +datum=NAD83 +units=m +no_defs
## wkt:
## PROJCRS["unknown",
## BASEGEOGCRS["unknown",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6269]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]]],
## CONVERSION["unknown",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",39.3333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-122,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",40,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",41.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",2000000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]]
st_crs(sac.metro.tracts.w)
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
#create new dataset with transformed CRS
ca_pts_crs.transformed <- st_transform(ca_pts_crs,st_crs(sac.metro.tracts.w))
st_crs(ca_pts_crs.transformed )
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
OK, let’s try this again!
ca_pts_crs.w <- st_join(ca_pts_crs.transformed, sac.metro.tracts.w, join = st_within, left=FALSE)
It worked! OK, now let’s try our map again.
ggplot() +
geom_sf(data = sac.metro.tracts.w) +
geom_sf(data = ca_pts_crs.w, fill = "black") +
labs(title = "Study Participants",
caption = "Source: Ovarian Cancer Cases") +
theme_bw()

Alright, who’s ready for a challenge? Let’s put it all together in one nice map.
ggplot() +
geom_sf(data = sac.metro.tracts.w, aes(fill = medincE), color = NA) +
scale_fill_gradient(low= "white", high = "red", na.value ="gray", name = "Median Income") +
geom_sf(data = ca_pts_crs.w, fill = "black") +
labs(title = "Study Participants Overlaid with Median Income of Sacramento Tracts",
caption = "Source: American Community Survey and Ovarian Cancer Cases") +
theme_bw()

Can I just say, you’re very impressive. Well done!
Whether you prefer tmap or ggplot is up to you, but I find that tmap has some benefits, so let’s focus on its mapping functions next.
tmap uses the same layered logic as
ggplot. As we saw last week, the initial command is
tm_shape(), which specifies the geography to which the
mapping is applied. You then build on tm_shape() by adding
one or more elements such as tm_polygons() for polygons,
tm_borders() for lines, and tm_dots() for
points. All additional functions take on the form of tm_.
Check the full list of tm_ elements here.
Let’s make a static choropleth map of median household income in Sacramento MSA just like we did above, but this time in tmap.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style' to 'tm_scale_intervals(<HERE>)'

We first put the dataset sac.metro.tracts.w inside
tm_shape(). Because you are plotting polygons, you use
tm_polygons() next. The argument
col = "medincE" tells R to shade (or color) the tracts by
the variable medincE. The argument
style = "quantile" tells R to break up the shading into
quantiles, or equal groups of 5 as a default. I find that this is where
tmap offers a distinct advantage over
ggplot in that users have greater control over the
legend and bin breaks. tmap allows users to specify
algorithms to automatically create breaks with the style argument. You
can also change the number of breaks by setting n=. The
default is n=5. Rather than quintiles, you can show
quartiles using n=4. I’m feeling crazy. Let’s do it.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile", n=4)
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n' to 'tm_scale_intervals(<HERE>)'

Check out this link for more on available classification styles in tmap.
The tm_polygons() command is a wrapper around two other
functions, tm_fill() and tm_borders().
tm_fill() controls the contents of the polygons (color,
classification, etc.), while tm_borders() does the same for
the polygon outlines.
For example, using the same shape (but no variable), we obtain the
outlines of the neighborhoods from the tm_borders()
command.
tm_shape(sac.metro.tracts.w) +
tm_borders()

Similarly, we obtain a choropleth map without the polygon outlines
when we just use the tm_fill() command.
tm_shape(sac.metro.tracts.w) +
tm_fill("medincE")

When we combine the two commands, we obtain the same map as with tm_polygons() (this illustrates how in R one can often obtain the same result in a number of different ways). Try this on your own.
The argument palette = defines the color ranges
associated with the bins and determined by the style
arguments. Several built-in palettes are contained in
tmap. For example, using palette = "Reds"
would yield the following map for our example.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

Under the hood, “Reds” refers to one of the color
schemes supported by the RColorBrewer package (see
below).
In addition to the built-in palettes, customized color ranges can be
created by specifying a vector with the desired colors as anchors. This
will create a spectrum of colors in the map that range between the
colors specified in the vector. For instance, if we used
c(“red”, “blue”), the color spectrum would move from red to
purple, then to blue, with in between shades. In our example:
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = c("red","blue"))
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'

Not exactly a pretty picture. In order to capture a diverging scale,
we insert “white” in between red and blue.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = c("red","white", "blue"))
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'

A preferred approach to select a color palette is to chose one of the schemes contained in the RColorBrewer package. These are based on the research of cartographer Cynthia Brewer (see the colorbrewer2 website for details). ColorBrewer makes a distinction between sequential scales (for a scale that goes from low to high), diverging scales (to highlight how values differ from a central tendency), and qualitative scales (for categorical variables). For each scale, a series of single hue and multi-hue scales are suggested. In the RColorBrewer package, these are referred to by a name (e.g., the “Reds” palette we used above is an example). The full list is contained in the RColorBrewer documentation.
There are two very useful commands in this package. One sets a color palette by specifying its name and the number of desired categories. The result is a character vector with the hex codes of the corresponding colors.
For example, we select a sequential color scheme going from blue to
green, as BuGn, by means of the command
brewer.pal, with the number of categories (6) and the
scheme as arguments. The resulting vector contains the HEX codes for the
colors.
brewer.pal(6,"BuGn")
## [1] "#EDF8FB" "#CCECE6" "#99D8C9" "#66C2A4" "#2CA25F" "#006D2C"
Using this palette in our map yields the following result.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette="BuGn")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "BuGn" is named
## "brewer.bu_gn"
## Multiple palettes called "bu_gn" found: "brewer.bu_gn", "matplotlib.bu_gn". The first one, "brewer.bu_gn", is returned.

The command display.brewer.pal() allows us to explore
different color schemes before applying them to a map. For example:
display.brewer.pal(6,"BuGn")

There are many options to change the formatting of the legend. The
automatic title for the legend is not that attractive, since it is
simply the variable name. This can be customized by setting the
title argument.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

Another important aspect of the legend is its positioning. This is
handled through the tm_layout() function. This function has
a vast number of options, as detailed in the documentation.
There are also specialized subsets of layout functions, focused on
specific aspects of the map, such as tm_legend(),
tm_style() and tm_format(). We illustrate the
positioning of the legend.
Often, the default location of the legend is appropriate, but
sometimes further control is needed. The legend.position
argument to the tm_layout function moves the legend around
the map, and it takes a vector of two string variables that determine
both the horizontal position (“left”, “right”, or “center”) and the
vertical position (“top”, “bottom”, or “center”).
For example, if we would want to move the legend to the bottom-right position, we would use the following set of commands.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_layout(legend.position = c("right", "bottom"))
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

There is also the option to position the legend outside the frame of
the map. This is accomplished by setting legend.outside to
TRUE, and optionally also specify its position by means of
legend.outside.position(). The latter can take the values
“top”, “bottom”, “right”, and “left”.
For example, to position the legend outside and on the right, would be accomplished by the following commands.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

We can also customize the size of the legend, its alignment, font, etc. Check out the documentation for more!
Another functionality of the tm_layout() function is to
set a title for the map, and specify its position, size, etc. For
example, we can set the title, the title.size and the
title.position as in the example below. We made the font
size a bit smaller (0.8) in order not to overwhelm the map, and
positioned it in the top left-hand corner.
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_layout(title = "Median Income of Sacramento Tracts", title.size = 0.8,
title.position = c("left","top"),
legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

To have a title appear on top (or on the bottom) of the map, we need
to set the main.title argument of the
tm_layout() function, with the associated
main.title.position, as illustrated below (with title.size
set to 1.25 to have a larger font).
tm_shape(sac.metro.tracts.w) +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_layout(main.title = "Median Income of Sacramento Tracts",
main.title.size = 1.25, main.title.position="center",
legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

OK this really wouldn’t be a GIS class without talking about one of
the core elements of a map–the good ole scale bar and arrow. Let’s add
these to our map. First, we add the scale bar with
tm_scale_bar().
The argument breaks tells R the distances to break up
and end the bar. The argument position places the scale bar
on the bottom left part of the map. Note that the scale is
in miles (we’re in Amurica!). The default is in kilometers (the rest of
the world!), but you can specify the units within
tm_shape() using the argument unit.
text.size scales the size of the bar smaller (below 1) or
larger (above 1).
tm_shape(sac.metro.tracts.w, unit = "mi") +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_scale_bar(breaks = c(0, 5, 10, 20), text.size = 0.75, position = c("left", "bottom")) +
tm_layout(main.title = "Median Income of Sacramento Tracts",
main.title.size = 1.25, main.title.position="center",
legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## ! `tm_scale_bar()` is deprecated. Please use `tm_scalebar()` instead.
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

Next let’s spice things up by adding a north arrow, which we can do
using the function tm_compass(). You can control for the
type, size and location of the arrow within this function. I place a
4-star arrow on the bottom right of the map.
tm_shape(sac.metro.tracts.w, unit = "mi") +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_scale_bar(breaks = c(0, 5, 10, 20), text.size = 0.75,
position = c("left", "bottom")) +
tm_compass(type = "4star", position = c("right", "bottom")) +
tm_layout(main.title = "Median Income of Sacramento Tracts",
main.title.size = 1.25, main.title.position="center",
legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## ! `tm_scale_bar()` is deprecated. Please use `tm_scalebar()` instead.
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

We can also eliminate the frame around the map using the argument
frame = FALSE.
sac.map <- tm_shape(sac.metro.tracts.w, unit = "mi") +
tm_polygons(col = "medincE", style = "quantile",palette = "Reds",
title = "Median Income") +
tm_scale_bar(breaks = c(0, 5, 10, 20), text.size = 0.75, position = c("left", "bottom")) +
tm_compass(type = "4star", position = c("right", "bottom")) +
tm_layout(main.title = "Median Income of Sacramento tracts",
main.title.size = 1.25, frame = FALSE,
main.title.position="center",
legend.outside = TRUE, legend.outside.position = "right")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## ! `tm_scale_bar()` is deprecated. Please use `tm_scalebar()` instead.
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
sac.map
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.

Note that I saved the map into an object called sac.map. R
is an object-oriented program, so everything you make in R are objects
that can be saved for future manipulation. This includes maps. And
future manipulations of a saved map includes adding more
tm_* functions to the saved object, such as
sac.map + tm_layout(your changes here). Check the help
documentation for tm_layout() to see the complete list of
settings.
You can save your maps a few ways. 1. On the plotting screen where
the map is shown, click on Export and save it as either an image or pdf
file. 2. Use the function tmap_save()
For option 2, we can save the map object sac.map as such:
tmap_save(sac.map, "sac_city_inc.jpg")
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.
##
## Map saved to /Users/pjames1/Dropbox/UC Davis Folders/SPH 215 GIS and Public Health/Github_Website/SPH215/sac_city_inc.jpg
##
## Resolution: 2598.137 by 1697.37 pixels
##
## Size: 8.660457 by 5.6579 inches (300 dpi)
Specify the tmap object and a filename with an extension. It supports .pdf, .eps, .svg, .wmf, .png, .jpg, .bmp and .tiff. The default is .png. Also make sure you’ve set your working directory to the folder that you want your map to be saved in.
OK, do we have energy for one more example? Let’s bring in data from the CDC Places dataset. This is an incredible resource to access data on the CDC’s Behavioral Risk Factor and Surveillance System (BRFSS), as well as social determinants of health data from American Community Survey. I’ve already downloaded California Census tract data on the “% of adults reporting no leisure-time physical activity”. Let’s bring this in and take a look at it!
places_ca_lpa<-read_csv("/Users/pjames1/Dropbox/UC Davis Folders/SPH 215 GIS and Public Health/Github_Website/SPH215/places_ca_lpa.csv")
## Rows: 9070 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl (6): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(places_ca_lpa)
## Rows: 9,070
## Columns: 24
## $ Year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
## $ StateAbbr <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "…
## $ StateDesc <chr> "California", "California", "California", "…
## $ CountyName <chr> "Orange", "Riverside", "Riverside", "Sacram…
## $ CountyFIPS <chr> "06059", "06065", "06065", "06067", "06073"…
## $ LocationName <chr> "06059063010", "06065042509", "06065042519"…
## $ DataSource <chr> "BRFSS", "BRFSS", "BRFSS", "BRFSS", "BRFSS"…
## $ Category <chr> "Health Risk Behaviors", "Health Risk Behav…
## $ Measure <chr> "No leisure-time physical activity among ad…
## $ Data_Value_Unit <chr> "%", "%", "%", "%", "%", "%", "%", "%", "%"…
## $ Data_Value_Type <chr> "Crude prevalence", "Crude prevalence", "Cr…
## $ Data_Value <dbl> 13.1, 33.6, 40.2, 26.6, 32.9, 12.9, 18.7, 2…
## $ Data_Value_Footnote_Symbol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Data_Value_Footnote <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Low_Confidence_Limit <dbl> 11.3, 30.0, 36.3, 23.4, 29.6, 11.2, 16.4, 1…
## $ High_Confidence_Limit <dbl> 15.0, 37.4, 44.2, 29.9, 36.3, 14.7, 21.1, 2…
## $ TotalPopulation <dbl> 6698, 3460, 1887, 7420, 3601, 4831, 4778, 3…
## $ TotalPop18plus <dbl> 5354, 2491, 1297, 5680, 2534, 4017, 3684, 2…
## $ Geolocation <chr> "POINT (-117.8983938 33.6281163)", "POINT (…
## $ LocationID <chr> "06059063010", "06065042509", "06065042519"…
## $ CategoryID <chr> "RISKBEH", "RISKBEH", "RISKBEH", "RISKBEH",…
## $ MeasureId <chr> "LPA", "LPA", "LPA", "LPA", "LPA", "LPA", "…
## $ DataValueTypeID <chr> "CrdPrv", "CrdPrv", "CrdPrv", "CrdPrv", "Cr…
## $ Short_Question_Text <chr> "Physical Inactivity", "Physical Inactivity…
Interesting stuff. Looks like the values we care about are stores in a column called Data_Value and the FIPS code seems to be in LocationName. Let’s go ahead and rename LocationName and then see if we can join this data with our Census tract data ca.tracts.
places_ca_lpa<-rename(places_ca_lpa, GEOID = LocationName)
glimpse(places_ca_lpa)
## Rows: 9,070
## Columns: 24
## $ Year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
## $ StateAbbr <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "…
## $ StateDesc <chr> "California", "California", "California", "…
## $ CountyName <chr> "Orange", "Riverside", "Riverside", "Sacram…
## $ CountyFIPS <chr> "06059", "06065", "06065", "06067", "06073"…
## $ GEOID <chr> "06059063010", "06065042509", "06065042519"…
## $ DataSource <chr> "BRFSS", "BRFSS", "BRFSS", "BRFSS", "BRFSS"…
## $ Category <chr> "Health Risk Behaviors", "Health Risk Behav…
## $ Measure <chr> "No leisure-time physical activity among ad…
## $ Data_Value_Unit <chr> "%", "%", "%", "%", "%", "%", "%", "%", "%"…
## $ Data_Value_Type <chr> "Crude prevalence", "Crude prevalence", "Cr…
## $ Data_Value <dbl> 13.1, 33.6, 40.2, 26.6, 32.9, 12.9, 18.7, 2…
## $ Data_Value_Footnote_Symbol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Data_Value_Footnote <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Low_Confidence_Limit <dbl> 11.3, 30.0, 36.3, 23.4, 29.6, 11.2, 16.4, 1…
## $ High_Confidence_Limit <dbl> 15.0, 37.4, 44.2, 29.9, 36.3, 14.7, 21.1, 2…
## $ TotalPopulation <dbl> 6698, 3460, 1887, 7420, 3601, 4831, 4778, 3…
## $ TotalPop18plus <dbl> 5354, 2491, 1297, 5680, 2534, 4017, 3684, 2…
## $ Geolocation <chr> "POINT (-117.8983938 33.6281163)", "POINT (…
## $ LocationID <chr> "06059063010", "06065042509", "06065042519"…
## $ CategoryID <chr> "RISKBEH", "RISKBEH", "RISKBEH", "RISKBEH",…
## $ MeasureId <chr> "LPA", "LPA", "LPA", "LPA", "LPA", "LPA", "…
## $ DataValueTypeID <chr> "CrdPrv", "CrdPrv", "CrdPrv", "CrdPrv", "Cr…
## $ Short_Question_Text <chr> "Physical Inactivity", "Physical Inactivity…
ca.tracts.lpa <- ca.tracts %>%
left_join(places_ca_lpa, by = "GEOID")
glimpse(ca.tracts.lpa)
## Rows: 9,129
## Columns: 41
## $ GEOID <chr> "06001442700", "06001442800", "06037204920"…
## $ Tract <chr> "Census Tract 4427", "Census Tract 4428", "…
## $ County <chr> "Alameda County", "Alameda County", "Los An…
## $ State <chr> "California", "California", "California", "…
## $ tpoprE <dbl> 2998, 3087, 2459, 3690, 3402, 3409, 3490, 8…
## $ tpoprM <dbl> 475, 386, 387, 446, 438, 624, 605, 350, 509…
## $ nhwhiteE <dbl> 921, 779, 68, 14, 236, 15, 500, 1379, 138, …
## $ nhwhiteM <dbl> 168, 183, 33, 14, 131, 24, 264, 165, 49, 83…
## $ nhblkE <dbl> 119, 12, 49, 8, 65, 8, 17, 2633, 18, 18, 93…
## $ nhblkM <dbl> 115, 49, 51, 12, 93, 10, 75, 344, 22, 19, 9…
## $ nhasnE <dbl> 1577, 1671, 0, 0, 123, 70, 1030, 238, 151, …
## $ nhasnM <dbl> 469, 356, 14, 14, 164, 104, 235, 76, 82, 56…
## $ hispE <dbl> 308, 561, 2331, 3642, 2978, 3303, 1919, 354…
## $ hispM <dbl> 81, 181, 387, 446, 481, 628, 519, 285, 519,…
## $ NAME <chr> "Census Tract 4427; Alameda County; Califor…
## $ medincE <dbl> 199154, 180800, 70500, 52262, 110967, 28516…
## $ medincM <dbl> 30525, 28293, 15698, 5659, 22149, 3246, 285…
## $ Year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
## $ StateAbbr <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "…
## $ StateDesc <chr> "California", "California", "California", "…
## $ CountyName <chr> "Alameda", "Alameda", "Los Angeles", "Los A…
## $ CountyFIPS <chr> "06001", "06001", "06037", "06037", "06037"…
## $ DataSource <chr> "BRFSS", "BRFSS", "BRFSS", "BRFSS", "BRFSS"…
## $ Category <chr> "Health Risk Behaviors", "Health Risk Behav…
## $ Measure <chr> "No leisure-time physical activity among ad…
## $ Data_Value_Unit <chr> "%", "%", "%", "%", "%", "%", "%", "%", "%"…
## $ Data_Value_Type <chr> "Crude prevalence", "Crude prevalence", "Cr…
## $ Data_Value <dbl> 14.2, 18.8, 38.2, 38.7, 25.3, 40.2, 30.7, 2…
## $ Data_Value_Footnote_Symbol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Data_Value_Footnote <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Low_Confidence_Limit <dbl> 12.2, 16.3, 35.2, 35.7, 22.9, 37.3, 28.1, 2…
## $ High_Confidence_Limit <dbl> 16.6, 21.6, 41.1, 41.7, 27.5, 43.1, 33.2, 2…
## $ TotalPopulation <dbl> 3141, 2959, 2537, 3850, 3632, 3858, 3335, 5…
## $ TotalPop18plus <dbl> 2432, 2359, 1903, 2599, 2822, 2783, 2695, 5…
## $ Geolocation <chr> "POINT (-122.0081094 37.5371514)", "POINT (…
## $ LocationID <chr> "06001442700", "06001442800", "06037204920"…
## $ CategoryID <chr> "RISKBEH", "RISKBEH", "RISKBEH", "RISKBEH",…
## $ MeasureId <chr> "LPA", "LPA", "LPA", "LPA", "LPA", "LPA", "…
## $ DataValueTypeID <chr> "CrdPrv", "CrdPrv", "CrdPrv", "CrdPrv", "Cr…
## $ Short_Question_Text <chr> "Physical Inactivity", "Physical Inactivity…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-122.0172 3...…
It worked! OK, now let’s make a map of % of adults reporting no
leisure-time physical activity. We will build on what we’ve learned
above! I’ve also included the option in tm_polygons() of
lwd = 0 which makes the borders a width of 0…basically we
are making it so there are no borders and we can easily see polygon
values in denser concentrations of Census tracts (e.g., around
Sacramento, San Francisco, LA, etc.)
tm_shape(ca.tracts.lpa, unit = "mi") +
tm_polygons(col = "Data_Value", style = "quantile",palette = "Reds",
title = "% Adults No Physical Activity", lwd = 0) +
tm_layout(main.title = "% of Adults Reporting No Leisure Time Physical Activity",
main.title.size = 1.25, main.title.position="center")
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'palette' (rename to 'values') to
## 'tm_scale_intervals(<HERE>)'
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Reds" is named
## "brewer.reds"
## Multiple palettes called "reds" found: "brewer.reds", "matplotlib.reds". The first one, "brewer.reds", is returned.
##
## [plot mode] fit legend/component: Some legend items or map compoments do not
## fit well, and are therefore rescaled.
## ℹ Set the tmap option `component.autoscale = FALSE` to disable rescaling.

Oooooooo that is one goooood looking map!
You’ve completed your introduction to sf. Whew! Badge? Yes, please, you earned it! Time to celebrate!